Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download Latin.unicharset along with radical-stroke.txt #219

Closed
wants to merge 2 commits into from
Closed

Conversation

Shreeshrii
Copy link
Collaborator

Need another PR to add Inherited.unicharset after tesseract-ocr/langdata_lstm#41 is merged

@stweil
Copy link
Collaborator

stweil commented Dec 17, 2020

All unicharset files for scripts are potentially needed, starting with Arabic.unicharset and ending with Thai.unicharset.

I usually get the required ones to satisfy the error message(s), but still don't know what happens if they are missing.

@Shreeshrii
Copy link
Collaborator Author

I added only Latin and Inherited unicharsets in this list because these are required in almost all cases, even though they don't stop processing like missing radical-stroke.txt.

We could add another optional variable for SCRIPT_UNICHARSET, downloading it when it is non-blank.

still don't know what happens if they are missing.

I think some characters e.g. Arabic accents get dropped in the generated unicharset by unicharset_extractor. That was the reason I built the Inherited.unicharset.

Makefile Outdated
@@ -303,6 +303,8 @@ $(OUTPUT_DIR).traineddata: $(LAST_CHECKPOINT)
endif

$(DATA_DIR)/radical-stroke.txt:
# wget -O $(DATA_DIR)/Inherited.unicharset 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/Inherited.unicharset'
wget -O $(DATA_DIR)/Latin.unicharset 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/Latin.unicharset'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put that in a separate Makefile target.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inherited.unicharset is NOT there in langdata_lstm repo. I created it by copying the lines with Inherited from other unicharsets. But there are some differences in coordinates for same character in different unicharsets, so I am not sure which one is to be used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi
how can I get the Inherited.unicharset

@stweil
Copy link
Collaborator

stweil commented Jan 14, 2021

A list of all required *.unicharset files can be extracted from unicharset:

sed s/.*0,0,0.// $(OUTPUT_DIR)/unicharset | sed 's/ .*//' | sort | uniq | grep "^[A-Z][a-z][a-z]*" | grep -v common

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 15, 2021

Thanks for the suggestions @stweil and the hint to get the list of required unicharsets from $(OUTPUT_DIR)/unicharset.

I am having a hard time putting it together in a separate Makefile target using the list. Would appreciate if you can make the required change.

Here is what I have tried so far:

SCRIPT_NAMES := $(shell cat $(OUTPUT_DIR)/unicharset | sed s/.*0,0,0.// | sed 's/ .*//' | sort | uniq | grep "^[A-Z][a-z][a-z]*" | grep -v common | sed '/Common/d' | sed '/Inherited/d' | sed '/Joined/d')
SCRIPT_UNICHARSETS = $(foreach script,$(SCRIPT_NAMES),$(script).unicharset)
scriptunicharsets: $(SCRIPT_UNICHARSETS)
$(DATA_DIR)/%.unicharset:%.unicharset
	echo $@
	wget -O $@ 'https://github.com/tesseract-ocr/langdata/raw/master/$@'

@wrznr
Copy link
Collaborator

wrznr commented Jan 22, 2021

@kba Could you pls. have a look at the change request and maybe come up with a proposal?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 22, 2021

I added sed '/Common/d' | sed '/Inherited/d' | sed '/Joined/d' to the command suggested by @stweil because there are no unicharsets for Common and Inherited . Joined was being picked up accidentally.

A simpler way maybe asking the user to specify a script and download that.

@Shreeshrii
Copy link
Collaborator Author

A simpler way maybe asking the user to specify a script and download that.

I have tried that in the new Makefile-font2model
I think that is a much cleaner way of doing this.

@Shreeshrii
Copy link
Collaborator Author

Included as part of #230

@Shreeshrii Shreeshrii closed this Feb 2, 2021
@Shreeshrii Shreeshrii deleted the PR6 branch February 2, 2021 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants